Working with text data

Author

Philipp Heinrich, Andreas Blombach

Published

February 16, 2023

library(data.table)
library(tidyverse)
library(tidytext)
library(NLP)
library(openNLP) # needs Java JDK: https://www.oracle.com/java/technologies/downloads/
library(corpustools)
library(topicmodels)
library(ggwordcloud)
library(wordcloud)

SMS Spam Collection

We’ll work with a small SMS corpus (publicly available at http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/) containing 5,574 messages which have been labelled as either spam or legitimate (ham). Thus, it is an interesting data set for feature extraction and machine learning: How can we best predict whether a message is spam or not?

(NB: Many ham messages come from Singapore. If a certain word confuses you (such as la, mah or lor), it’s probably Singlish: https://en.wikipedia.org/wiki/Singlish_vocabulary)

Before we can predict anything, however, we’ll have to read in the data and learn how to handle text data in R.

smsspam <- fread(
  "../data/smsspam.tsv", # path to file
  quote = "", #
  header = FALSE, # first row doesn't contain column names
  col.names = c("y", "text"), # name the two columns,
  encoding = "UTF-8" # specify encoding (some problems remain)
)

# this would also work:
# test <- read_tsv(
#   "data/smsspam.tsv",
#   quote = "",
#   col_names = c("y", "text")
# )

str(smsspam)
Classes 'data.table' and 'data.frame':  5574 obs. of  2 variables:
 $ y   : chr  "ham" "ham" "spam" "ham" ...
 $ text: chr  "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..." "Ok lar... Joking wif u oni..." "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question("| __truncated__ "U dun say so early hor... U c already then say..." ...
 - attr(*, ".internal.selfref")=<externalptr> 

Basic string manipulation

Here, we’ll take a look at common string functions, first in base R, then in the Tidyverse (which might be a little more intuitive, or, at the very least, a little more tidy).

In order not to be overwhelmed by the output of many functions, we’ll define two smaller objects to work with: texts, a character vector containing the first 10 texts from the SMS corpus, and text, the first of these texts.

texts <- head(smsspam$text, 10)
texts
 [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                                 
 [2] "Ok lar... Joking wif u oni..."                                                                                                                                   
 [3] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"     
 [4] "U dun say so early hor... U c already then say..."                                                                                                               
 [5] "Nah I don't think he goes to usf, he lives around here though"                                                                                                   
 [6] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"             
 [7] "Even my brother is not like to speak with me. They treat me like aids patent."                                                                                   
 [8] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
 [9] "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."   
[10] "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"      
text <- texts[2]
text
[1] "Ok lar... Joking wif u oni..."

Base R

Number of characters (length)

We’ve already used nchar() before. By default, this gives us the number of individual characters in a string (or character, but you can see why R’s terminology might be confusing here). As the function is vectorised, it can also be used to get the lengths of all strings in a vector:

nchar(text)
[1] 29
nchar(texts)
 [1] 111  29 155  49  61 147  77 160 157 154

Note that you can also the optional parameter type to get the number of bytes needed to store the string:

"\u2640"
[1] "♀"
nchar("\u2640", type = "bytes")
[1] 3

Substrings

To get a substring (part of a string), you need to specify start and end position (remember that in R, you start counting at 1, not at 0):

substr(text, start = 11, stop = 16)  # select
[1] "Joking"

You can also assign a new value to a substring, thereby changing the original string:

tmp <- text
substr(tmp, start = 11, stop = 16) <- "Kidding"
tmp
[1] "Ok lar... Kiddin wif u oni..."

What happens if replacement length and value to replace are different?

substr() also works on vectors:

substr(texts, start = 1, stop = 10)
 [1] "Go until j" "Ok lar... " "Free entry" "U dun say " "Nah I don'"
 [6] "FreeMsg He" "Even my br" "As per you" "WINNER!! A" "Had your m"

strsplit() can be used to, well, split strings at the position where a specified string (or regular expression) occurs.

Note that this function always yields a list:

strsplit(text, " ")  # split
[[1]]
[1] "Ok"     "lar..." "Joking" "wif"    "u"      "oni..."

You can use unlist() to simplify a list to a vector of all atomic elements contained in the list:

strsplit(text, " ") |> unlist()
[1] "Ok"     "lar..." "Joking" "wif"    "u"      "oni..."

The list output makes more sense when applying strsplit() to a vector:

strsplit(texts[1:2], " ")
[[1]]
 [1] "Go"        "until"     "jurong"    "point,"    "crazy.."   "Available"
 [7] "only"      "in"        "bugis"     "n"         "great"     "world"    
[13] "la"        "e"         "buffet..." "Cine"      "there"     "got"      
[19] "amore"     "wat..."   

[[2]]
[1] "Ok"     "lar..." "Joking" "wif"    "u"      "oni..."

Concatenation and insertion

To join strings (or vectors of strings), you can use paste():

paste("sms", text, sep = " : ")
[1] "sms : Ok lar... Joking wif u oni..."
paste("sms", texts, sep = " : ")
 [1] "sms : Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                                 
 [2] "sms : Ok lar... Joking wif u oni..."                                                                                                                                   
 [3] "sms : Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"     
 [4] "sms : U dun say so early hor... U c already then say..."                                                                                                               
 [5] "sms : Nah I don't think he goes to usf, he lives around here though"                                                                                                   
 [6] "sms : FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"             
 [7] "sms : Even my brother is not like to speak with me. They treat me like aids patent."                                                                                   
 [8] "sms : As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
 [9] "sms : WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."   
[10] "sms : Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"      
paste(head(smsspam$y, 10), ": ", texts, sep = "")
 [1] "ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                                 
 [2] "ham: Ok lar... Joking wif u oni..."                                                                                                                                   
 [3] "spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"    
 [4] "ham: U dun say so early hor... U c already then say..."                                                                                                               
 [5] "ham: Nah I don't think he goes to usf, he lives around here though"                                                                                                   
 [6] "spam: FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"            
 [7] "ham: Even my brother is not like to speak with me. They treat me like aids patent."                                                                                   
 [8] "ham: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
 [9] "spam: WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."  
[10] "spam: Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"     

To insert values at specific places in a string, you can use a wrapper for the C function sprintf(). In the following example, "%s: %s" is a string wherein %s functions as a string placeholder to be replaced by the value(s) of the corresponding variable:

sprintf("%s: %s", head(smsspam$y, 10), texts)
 [1] "ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                                 
 [2] "ham: Ok lar... Joking wif u oni..."                                                                                                                                   
 [3] "spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"    
 [4] "ham: U dun say so early hor... U c already then say..."                                                                                                               
 [5] "ham: Nah I don't think he goes to usf, he lives around here though"                                                                                                   
 [6] "spam: FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"            
 [7] "ham: Even my brother is not like to speak with me. They treat me like aids patent."                                                                                   
 [8] "ham: As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
 [9] "spam: WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."  
[10] "spam: Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"     

See the documentation (?sprintf) for details and other placeholders.

Case

Use tolower() and toupper() to convert a string to lower or to upper case, respectively.

tolower(text)
[1] "ok lar... joking wif u oni..."
toupper(text)
[1] "OK LAR... JOKING WIF U ONI..."

Searching with regular expressions

grep() returns a vector of the indices of strings in a vector that yielded a match for the given regular expression. Applying it to a single string is thus not very useful:

grep("[[:digit:]]", texts[1])
integer(0)
grep("[[:digit:]]", texts[3])
[1] 1
grep("[[:digit:]]", texts)
[1]  3  6  8  9 10

To get the strings containing matches, you can either use the returned indices to subset the original vector or just set value to TRUE:

# texts[grep("[[:digit:]]", texts)]
grep("[[:digit:]]", texts, value = TRUE)
[1] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"     
[2] "FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv"             
[3] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune"
[4] "WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only."   
[5] "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"      

grepl() will return TRUE for each string in a vector that yielded a match:

grepl("[[:digit:]]", texts)
 [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE

As of R version 4.0, you can use raw strings. Although they are a little more awkward than in Python, this still makes it easier to escape special characters in more complicated regular expressions.

A raw string has an r in front of the opening (single or double) quotation mark and round brackets inside the quotation marks: r"(...)" (where ... stands for any character sequence).

grepl("\\{", "this string contains {curly} brackets") # regular string needs double backslash to escape 
[1] TRUE
grepl(r"(\{)", "this string contains {curly} brackets") # raw string only needs a single backslash to escape
[1] TRUE

Replacing with regular expressions

gsub("[[:digit:]]", 0, texts[1])
[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."
gsub("[[:digit:]]", 0, texts[3])
[1] "Free entry in 0 a wkly comp to win FA Cup final tkts 00st May 0000. Text FA to 00000 to receive entry question(std txt rate)T&C's apply 00000000000over00's"
gsub("[[:digit:]]", 0, texts)
 [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                                 
 [2] "Ok lar... Joking wif u oni..."                                                                                                                                   
 [3] "Free entry in 0 a wkly comp to win FA Cup final tkts 00st May 0000. Text FA to 00000 to receive entry question(std txt rate)T&C's apply 00000000000over00's"     
 [4] "U dun say so early hor... U c already then say..."                                                                                                               
 [5] "Nah I don't think he goes to usf, he lives around here though"                                                                                                   
 [6] "FreeMsg Hey there darling it's been 0 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £0.00 to rcv"             
 [7] "Even my brother is not like to speak with me. They treat me like aids patent."                                                                                   
 [8] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *0 to copy your friends Callertune"
 [9] "WINNER!! As a valued network customer you have been selected to receivea £000 prize reward! To claim call 00000000000. Claim code KL000. Valid 00 hours only."   
[10] "Had your mobile 00 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 00000000000"      

Tidyverse (stringr)

The main advantage of functions from stringr is probably that they are easier to find since they all start with str_.

stringr is built on top of stringi, a comprehensive package that contains many more string functions – stringr focusses on the most common tasks.

DataCamp offers a whole course on string manipulation with stringr: https://learn.datacamp.com/courses/string-manipulation-with-stringr-in-r

String length

str_length(text)
[1] 29
str_length(texts)
 [1] 111  29 155  49  61 147  77 160 157 154

One caveat: Unicode allows characters to be combined. For example, the umlaut ü can be thought of as a combination of u (U+0075) and a diacritic (U+0308):

weird_umlaut <- "\u0075\u0308" # Apple users tend to produce these ...
weird_umlaut # NB: will be displayed as two characters in HTML output
[1] "ü"

But there is also a single character which looks exactly the same:

regular_umlaut <- "\u00fc"
regular_umlaut
[1] "ü"

These two code sequences are called canonically equivalent. But although they look the same, they are not:

regular_umlaut == weird_umlaut
[1] FALSE

This may also result in unexpected string lengths:

umlauts <- c(regular_umlaut, weird_umlaut)
str_length(umlauts) # same for nchar()
[1] 1 2
str_count(umlauts) # what we'd expect
[1] 1 1

Luckily, there are different Unicode normalisation forms (see https://unicode.org/reports/tr15/) to handle canonical equivalence (and the weaker compatibility equivalence of characters which represent the same abstract character but are displayed differently, e.g. ² and 2).

NFC is the normalisation form for canonical decomposition, followed by canonical composition: If a combination of Unicode characters can be represented by a single Unicode character that looks the same, that’s the way to go.

When dealing with text data from multiple sources (e.g. text scraped from different web pages), you may want to perform Unicode normalisation to ensure you don’t run into problems later on.

Check if strings are in NFC form:

stringi::stri_trans_isnfc(umlauts)
[1]  TRUE FALSE

NFC normalisation:

umlauts <- stringi::stri_trans_nfc(umlauts)
umlauts[1] == umlauts[2]
[1] TRUE

Substrings

str_sub() works just as substr(), but you can also use negative values to count from the end of the string.

str_sub(text, start = 1L, end = 16L)
[1] "Ok lar... Joking"
str_sub(text, start = -19L, end = -1L)
[1] "Joking wif u oni..."

Replacement works almost the same way:

tmp <- text
str_sub(tmp, start = 11L, end = 16L) <- "Kidding"
tmp
[1] "Ok lar... Kidding wif u oni..."

The difference is that the replacement string will always be inserted completely.

Splitting strings with str_split() works mostly the same as with strsplit(). You just get a few additional options, like simplify = TRUE to return a matrix instead of a list.

str_split(texts[1:2], " ")
[[1]]
 [1] "Go"        "until"     "jurong"    "point,"    "crazy.."   "Available"
 [7] "only"      "in"        "bugis"     "n"         "great"     "world"    
[13] "la"        "e"         "buffet..." "Cine"      "there"     "got"      
[19] "amore"     "wat..."   

[[2]]
[1] "Ok"     "lar..." "Joking" "wif"    "u"      "oni..."

Concatenation and insertion

To join strings, use str_c(). The default value of sep is an empty string.

str_c(head(smsspam$y, 3), ": ", texts[1:3])
[1] "ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                             
[2] "ham: Ok lar... Joking wif u oni..."                                                                                                                               
[3] "spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's"

You can “flatten” a vector of strings into a single string using str_flatten():

str_flatten(texts, "\n\n")
[1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\n\nOk lar... Joking wif u oni...\n\nFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\n\nU dun say so early hor... U c already then say...\n\nNah I don't think he goes to usf, he lives around here though\n\nFreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv\n\nEven my brother is not like to speak with me. They treat me like aids patent.\n\nAs per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune\n\nWINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.\n\nHad your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030"

You can use str_glue() to insert variable values into strings:

str_glue("{head(smsspam$y, 3)}: {texts[1:3]}")
ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham: Ok lar... Joking wif u oni...
spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

Named arguments are also possible:

str_glue("{label}: {text}",
         label = head(smsspam$y, 3),
         text = texts[1:3])
ham: Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
ham: Ok lar... Joking wif u oni...
spam: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

Case

To convert a string to lower or upper case, use str_to_lower() or str_to_upper(). There are also functions for sentence and title case; see ?case.

Removing whitespace

To remove whitespace from start and end of strings, use str_trim(). To also remove repeated whitespace inside of strings, use str_squish().

stupid_string <- " \t some  words\t\tand\n stuff   "
str_trim(stupid_string)
[1] "some  words\t\tand\n stuff"
str_squish(stupid_string)
[1] "some words and stuff"

Searching with regular expressions

str_count() offers a nice way to count the number of occurrences of a specific pattern in a string.

How many vowels are there in text? (Linguistically naïve, but still …)

str_count(text, "[aeiouAEIOU]")
[1] 8

A very simple word count using a regular expression for alphanumericals (letters and digits):

str_count(texts, "[[:alnum:]]+") # simple word count
 [1] 20  6 33 11 14 36 16 26 26 29

str_detect() tells you if a string contains a given pattern:

str_detect(texts, "^[fF]ree") # does the string start with "Free" or "free"?
 [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

This is often useful to filter data sets:

smsspam |> filter(str_detect(text, "[0-9]{5,}")) # texts containing at least 5 consecutive digits

str_extract() extracts the first matching string:

str_extract(texts, "[[:alnum:]]+")
 [1] "Go"      "Ok"      "Free"    "U"       "Nah"     "FreeMsg" "Even"   
 [8] "As"      "WINNER"  "Had"    

str_extract_all() extracts all matching strings (and therefore returns a list):

str_extract_all(texts[1:3], "[[:alnum:]]+")
[[1]]
 [1] "Go"        "until"     "jurong"    "point"     "crazy"     "Available"
 [7] "only"      "in"        "bugis"     "n"         "great"     "world"    
[13] "la"        "e"         "buffet"    "Cine"      "there"     "got"      
[19] "amore"     "wat"      

[[2]]
[1] "Ok"     "lar"    "Joking" "wif"    "u"      "oni"   

[[3]]
 [1] "Free"              "entry"             "in"               
 [4] "2"                 "a"                 "wkly"             
 [7] "comp"              "to"                "win"              
[10] "FA"                "Cup"               "final"            
[13] "tkts"              "21st"              "May"              
[16] "2005"              "Text"              "FA"               
[19] "to"                "87121"             "to"               
[22] "receive"           "entry"             "question"         
[25] "std"               "txt"               "rate"             
[28] "T"                 "C"                 "s"                
[31] "apply"             "08452810075over18" "s"                

Replacing with regular expressions

str_replace() does exactly what it promises, but only for the first matching occurrence:

str_replace(text, "i", "I")
[1] "Ok lar... JokIng wif u oni..."

If you want to replace all occurrences of a given pattern, use str_replace_all() instead:

str_replace_all(texts, "\\b[0-9]+\\b", "NUMBER")
 [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat..."                                                      
 [2] "Ok lar... Joking wif u oni..."                                                                                                                                        
 [3] "Free entry in NUMBER a wkly comp to win FA Cup final tkts 21st May NUMBER. Text FA to NUMBER to receive entry question(std txt rate)T&C's apply 08452810075over18's"  
 [4] "U dun say so early hor... U c already then say..."                                                                                                                    
 [5] "Nah I don't think he goes to usf, he lives around here though"                                                                                                        
 [6] "FreeMsg Hey there darling it's been NUMBER week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £NUMBER.NUMBER to rcv"    
 [7] "Even my brother is not like to speak with me. They treat me like aids patent."                                                                                        
 [8] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *NUMBER to copy your friends Callertune"
 [9] "WINNER!! As a valued network customer you have been selected to receivea £NUMBER prize reward! To claim call NUMBER. Claim code KL341. Valid NUMBER hours only."      
[10] "Had your mobile NUMBER months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on NUMBER"            

Grouping and backreferences work, but we won’t go that far now.

Packages: NLP & openNLP

openNLP provides an interface to the Apache OpenNLP tools, a Java toolkit for typical NLP (natural language processing) tasks such as tokenisation, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, language detection and coreference resolution (https://opennlp.apache.org).

NLP has a special class for strings:

txt <- as.String(texts)

This makes it possible to access substrings using square brackets:

txt[1, 21]
Go until jurong point

Language settings

While English is available by default, you will need additional model files if you want to handle text in other languages. Pre-trained models are available here: http://opennlp.sourceforge.net/models-1.5/

You can conveniently install these from the repository at https://datacube.wu.ac.at. For example, if we wanted models for German, we’d do this:

install.packages("openNLPmodels.de", repos = "http://datacube.wu.ac.at/", type = "source")

Then, in a given openNLP function requiring a model, you’d set the parameter language to "de". For example:

Maxent_Sent_Token_Annotator(language = "de")

In case there’s more than one model available for a single component, such as the POS-Tagger, you can use the model parameter to specify which model you’d like to use:

Maxent_POS_Tag_Annotator(model = "path/to/Rlibrary/openNLPmodels.de/models/de-pos-perceptron.bin")

Note: Even if a model is available for a particular language and component, performance on your own data may be poor, e.g. because it differs greatly from the data on which the model was trained (usually specified in the model’s description). In this case, check for better models elsewhere, train a model yourself or ask a computational linguist for help!

Sentence annotation

This first step is needed to identify sentence boundaries.

We will create an object using the Maxent_Sent_Token_Annotator() function which we will then use as an argument in the annotate() function (from NLP).

sentence_annotator <- Maxent_Sent_Token_Annotator()

sentence_annotation <- annotate(txt, sentence_annotator)
sentence_annotation
 id type     start end  features
  1 sentence     1  209 
  2 sentence   211  476 
  3 sentence   478  515 
  4 sentence   517  603 
  5 sentence   605  635 
  6 sentence   637  755 
  7 sentence   757  805 
  8 sentence   807  888 
  9 sentence   890  915 
 10 sentence   917  933 
 11 sentence   935  954 
 12 sentence   956  989 
 13 sentence   991 1063 
 14 sentence  1065 1109 

If we check these boundaries, we can already see that it doesn’t work well with our SMS data:

txt[sentence_annotation]
 [1] "Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...\nOk lar... Joking wif u oni...\nFree entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."                                                          
 [2] "Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's\nU dun say so early hor... U c already then say...\nNah I don't think he goes to usf, he lives around here though\nFreeMsg Hey there darling it's been 3 week's now and no word back!"
 [3] "I'd like some fun you up for it still?"                                                                                                                                                                                                                                       
 [4] "Tb ok! XxX std chgs to send, £1.50 to rcv\nEven my brother is not like to speak with me."                                                                                                                                                                                     
 [5] "They treat me like aids patent."                                                                                                                                                                                                                                              
 [6] "As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers."                                                                                                                                                      
 [7] "Press *9 to copy your friends Callertune\nWINNER!!"                                                                                                                                                                                                                           
 [8] "As a valued network customer you have been selected to receivea £900 prize reward!"                                                                                                                                                                                           
 [9] "To claim call 09061701461."                                                                                                                                                                                                                                                   
[10] "Claim code KL341."                                                                                                                                                                                                                                                            
[11] "Valid 12 hours only."                                                                                                                                                                                                                                                         
[12] "Had your mobile 11 months or more?"                                                                                                                                                                                                                                           
[13] "U R entitled to Update to the latest colour mobiles with camera for Free!"                                                                                                                                                                                                    
[14] "Call The Mobile Update Co FREE on 08002986030"                                                                                                                                                                                                                                

Line breaks were introduced by converting a vector of texts to a String object (txt). However, the sentence detector doesn’t properly recognise these as sentence boundaries. The same goes for the ellipis (“dot-dot-dot”) in several messages.

We probably need a better model for this (trained on SMS or CMC data), but if we aren’t interested in punctuation marks later on, we could also replace parts of our input (generally, we wouldn’t recommended this):

txt <- texts[1:5] |>
  str_replace_all("\\.{2,}", "\\.") |> # replace several dots by single dots
  str_flatten(collapse = "</sms> ") |> # combine SMS into single text
  str_replace_all("([[:alnum:]])</sms>", "\\1.") |> # replace SMS endings without punctuation marks with dots
  str_remove_all("</sms>") |> # clean up
  as.String()
sentence_annotation <- annotate(txt, sentence_annotator)
txt[sentence_annotation]
 [1] "Go until jurong point, crazy."                                                           
 [2] "Available only in bugis n great world la e buffet."                                      
 [3] "Cine there got amore wat."                                                               
 [4] "Ok lar."                                                                                 
 [5] "Joking wif u oni."                                                                       
 [6] "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005."                     
 [7] "Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's."
 [8] "U dun say so early hor."                                                                 
 [9] "U c already then say."                                                                   
[10] "Nah I don't think he goes to usf, he lives around here though"                           

This looks better.

If we were to do sentence boundary detection on single SMS texts, we wouldn’t have to care about the issue regarding line breaks.

In general: If you notice errors in early annotation steps, think about error propagation. In an NLP pipeline (sentence boundary detection -> tokenisation -> PoS-tagging -> dependency parsing -> …), the accuracy of later steps depends upon the accuracy of earlier steps. If a text hasn’t been properly tokenised, POS tags will often be rubbish. If PoS tags are wrong, or if sentence boundaries are wrong or missing, this will affect the resulting dependency trees – and so on. Most importantly, however, it may eventually affect your actual analysis.

Ultimately, it may be worthwhile to invest some time in improving “boring” or “solved” pre-processing and annotation steps.

Tokenisation

We can now tokenise the text. This time, we pass an additional argument to annotate(): the annotations to start with, namely sentence_annotation.

token_annotator <- Maxent_Word_Token_Annotator()
token_annotation <- annotate(txt,
                             token_annotator,
                             sentence_annotation)
token_annotation
 id  type     start end features
   1 sentence     1  29 constituents=<<integer,7>>
   2 sentence    31  80 constituents=<<integer,11>>
   3 sentence    82 106 constituents=<<integer,6>>
   4 sentence   108 114 constituents=<<integer,3>>
   5 sentence   116 132 constituents=<<integer,5>>
   6 sentence   134 200 constituents=<<integer,17>>
   7 sentence   202 289 constituents=<<integer,15>>
   8 sentence   291 313 constituents=<<integer,7>>
   9 sentence   315 335 constituents=<<integer,6>>
  10 sentence   337 397 constituents=<<integer,15>>
  11 word         1   2 
  12 word         4   8 
  13 word        10  15 
  14 word        17  21 
  15 word        22  22 
  16 word        24  28 
  17 word        29  29 
  18 word        31  39 
  19 word        41  44 
  20 word        46  47 
  21 word        49  53 
  22 word        55  55 
  23 word        57  61 
  24 word        63  67 
  25 word        69  70 
  26 word        72  72 
  27 word        74  79 
  28 word        80  80 
  29 word        82  85 
  30 word        87  91 
  31 word        93  95 
  32 word        97 101 
  33 word       103 105 
  34 word       106 106 
  35 word       108 109 
  36 word       111 113 
  37 word       114 114 
  38 word       116 121 
  39 word       123 125 
  40 word       127 127 
  41 word       129 131 
  42 word       132 132 
  43 word       134 137 
  44 word       139 143 
  45 word       145 146 
  46 word       148 148 
  47 word       150 150 
  48 word       152 155 
  49 word       157 160 
  50 word       162 163 
  51 word       165 167 
  52 word       169 170 
  53 word       172 174 
  54 word       176 180 
  55 word       182 185 
  56 word       187 190 
  57 word       192 194 
  58 word       196 199 
  59 word       200 200 
  60 word       202 205 
  61 word       207 208 
  62 word       210 211 
  63 word       213 217 
  64 word       219 220 
  65 word       222 228 
  66 word       230 234 
  67 word       236 247 
  68 word       249 251 
  69 word       253 260 
  70 word       261 262 
  71 word       264 268 
  72 word       270 286 
  73 word       287 288 
  74 word       289 289 
  75 word       291 291 
  76 word       293 295 
  77 word       297 299 
  78 word       301 302 
  79 word       304 308 
  80 word       310 312 
  81 word       313 313 
  82 word       315 315 
  83 word       317 317 
  84 word       319 325 
  85 word       327 330 
  86 word       332 334 
  87 word       335 335 
  88 word       337 339 
  89 word       341 341 
  90 word       343 344 
  91 word       345 347 
  92 word       349 353 
  93 word       355 356 
  94 word       358 361 
  95 word       363 364 
  96 word       366 368 
  97 word       369 369 
  98 word       371 372 
  99 word       374 378 
 100 word       380 385 
 101 word       387 390 
 102 word       392 397 

Let’s have a look at the actual tokens:

txt[token_annotation[token_annotation$type == "word"]] |> head(18)
 [1] "Go"        "until"     "jurong"    "point"     ","         "crazy"    
 [7] "."         "Available" "only"      "in"        "bugis"     "n"        
[13] "great"     "world"     "la"        "e"         "buffet"    "."        

PoS-Tagging

To add part-of-speech tags, we proceed in the same manner:

pos_tagger <- Maxent_POS_Tag_Annotator()
annotation <- annotate(txt, pos_tagger, token_annotation)
annotation
 id  type     start end features
   1 sentence     1  29 constituents=<<integer,7>>
   2 sentence    31  80 constituents=<<integer,11>>
   3 sentence    82 106 constituents=<<integer,6>>
   4 sentence   108 114 constituents=<<integer,3>>
   5 sentence   116 132 constituents=<<integer,5>>
   6 sentence   134 200 constituents=<<integer,17>>
   7 sentence   202 289 constituents=<<integer,15>>
   8 sentence   291 313 constituents=<<integer,7>>
   9 sentence   315 335 constituents=<<integer,6>>
  10 sentence   337 397 constituents=<<integer,15>>
  11 word         1   2 POS=VB
  12 word         4   8 POS=IN
  13 word        10  15 POS=JJ
  14 word        17  21 POS=NN
  15 word        22  22 POS=,
  16 word        24  28 POS=JJ
  17 word        29  29 POS=.
  18 word        31  39 POS=JJ
  19 word        41  44 POS=RB
  20 word        46  47 POS=IN
  21 word        49  53 POS=NN
  22 word        55  55 POS=RB
  23 word        57  61 POS=JJ
  24 word        63  67 POS=NN
  25 word        69  70 POS=DT
  26 word        72  72 POS=NN
  27 word        74  79 POS=NN
  28 word        80  80 POS=.
  29 word        82  85 POS=NNP
  30 word        87  91 POS=RB
  31 word        93  95 POS=VBD
  32 word        97 101 POS=RB
  33 word       103 105 POS=NN
  34 word       106 106 POS=.
  35 word       108 109 POS=NN
  36 word       111 113 POS=JJ
  37 word       114 114 POS=.
  38 word       116 121 POS=VBG
  39 word       123 125 POS=IN
  40 word       127 127 POS=PRP
  41 word       129 131 POS=NNS
  42 word       132 132 POS=.
  43 word       134 137 POS=JJ
  44 word       139 143 POS=NN
  45 word       145 146 POS=IN
  46 word       148 148 POS=CD
  47 word       150 150 POS=DT
  48 word       152 155 POS=RB
  49 word       157 160 POS=JJ
  50 word       162 163 POS=TO
  51 word       165 167 POS=VB
  52 word       169 170 POS=NNP
  53 word       172 174 POS=NNP
  54 word       176 180 POS=JJ
  55 word       182 185 POS=NNS
  56 word       187 190 POS=JJ
  57 word       192 194 POS=NNP
  58 word       196 199 POS=CD
  59 word       200 200 POS=.
  60 word       202 205 POS=NNP
  61 word       207 208 POS=VBD
  62 word       210 211 POS=TO
  63 word       213 217 POS=CD
  64 word       219 220 POS=TO
  65 word       222 228 POS=VB
  66 word       230 234 POS=NN
  67 word       236 247 POS=NN
  68 word       249 251 POS=IN
  69 word       253 260 POS=NNP
  70 word       261 262 POS=POS
  71 word       264 268 POS=VB
  72 word       270 286 POS=IN
  73 word       287 288 POS=PRP
  74 word       289 289 POS=.
  75 word       291 291 POS=NNP
  76 word       293 295 POS=NN
  77 word       297 299 POS=VBP
  78 word       301 302 POS=RB
  79 word       304 308 POS=JJ
  80 word       310 312 POS=NN
  81 word       313 313 POS=.
  82 word       315 315 POS=NNP
  83 word       317 317 POS=NN
  84 word       319 325 POS=RB
  85 word       327 330 POS=RB
  86 word       332 334 POS=VBP
  87 word       335 335 POS=.
  88 word       337 339 POS=NNP
  89 word       341 341 POS=PRP
  90 word       343 344 POS=VBP
  91 word       345 347 POS=RB
  92 word       349 353 POS=VB
  93 word       355 356 POS=PRP
  94 word       358 361 POS=VBZ
  95 word       363 364 POS=TO
  96 word       366 368 POS=RP
  97 word       369 369 POS=,
  98 word       371 372 POS=PRP
  99 word       374 378 POS=VBZ
 100 word       380 385 POS=RB
 101 word       387 390 POS=RB
 102 word       392 397 POS=IN

We can now work with the annotated text – or first transform it to another format.

tokens <- txt[annotation[annotation$type == "word"]]

# annotation$features is a list of lists which makes it a little
# difficult to handle:
# annotation$features[annotation$type == "word"]

# We can use unlist() to collapse list elements into a vector:
pos_sequence <- as.character(unlist(
  annotation$features[
    annotation$type == "word"
  ]
))

# We could also transform it to a data.frame:
# annotation |> as.data.frame() |> filter(type == "word")

# Or we could construct a tibble out of tokens and PoS tags
# (although it would probably be nice to further add sentence and
# text IDs:
tibble(ID = 1:length(tokens), Token = tokens, POS = pos_sequence)

Package: corpustools

corpustools is a package designed to manage, query and analyse tokenised text. For advanced pre-processing (e.g. part-of-speech tagging or parsing), it relies on other packages, such as spacyr, coreNLP or udpipe.

There’s a package vignette that provides a nice overview: https://cran.r-project.org/web/packages/corpustools/vignettes/corpustools.html

We’ll first create a corpus from a data.frame. Because every document needs a unique ID, we’ll have to add a new column to our SMS dataset:

smsspam_df <- data.frame(smsspam) # copy
smsspam_df$id <- sprintf("doc_%05d", 1:nrow(smsspam)) # new column
corpus <- create_tcorpus(smsspam_df,
                         doc_column = 'id',
                         text_columns = 'text',
                         split_sentences = TRUE,
                         verbose = FALSE) # suppress progress bar
corpus
tCorpus containing 108985 tokens
grouped by documents (n = 5574) and sentences (n = 11467)
contains:
  - 4 columns in $tokens:   doc_id, sentence, token_id, token
  - 2 columns in $meta:     doc_id, y

Let’s look at the tokens:

head(corpus$tokens)

And the metadata:

head(corpus$meta)

We can also import pre-processed tokens. We’ll use the sample data corenlp_tokens to demonstrate this.

head(corenlp_tokens)
tc = tokens_to_tcorpus(corenlp_tokens,
                       doc_col = 'doc_id',
                       sentence_col = 'sentence',
                       token_id_col = 'id')
tc
tCorpus containing 36 tokens
grouped by documents (n = 1) and sentences (n = 6)
contains:
  - 13 columns in $tokens:  doc_id, sentence, token_id, token, lemma, CharacterOffsetBegin, CharacterOffsetEnd, POS, NER, Speaker, parent, relation, pos1
  - 1 column in $meta:      doc_id

Subsetting

We can use subset() to filter our corpus (see ?subset.tCorpus).

Use the parameter subset to specify which rows to keep in the tokens data:

sent347 <- subset(corpus, subset = sentence == 2)
sent347
tCorpus containing 27648 tokens
grouped by documents (n = 2996) and sentences (n = 2996)
contains:
  - 4 columns in $tokens:   doc_id, sentence, token_id, token
  - 2 columns in $meta:     doc_id, y

Use the parameter subset_meta to specify which documents to keep:

doc50 <- subset(corpus, subset_meta = doc_id == "doc_00050")
doc50$tokens
first_spam <- subset(corpus,
                subset = token_id == 1,
                subset_meta = y == "spam")
first_spam$tokens |> select(doc_id, token)

Pre-processing

An object of the class tCorpus (like our corpus) has quite a few class methods we can use (have a look at str(corpus)).

Keep in mind that these class methods usually directly modify the corpus object. If you want to keep the original object, use tCorpus$copy() first:

corpus_old <- corpus$copy()

Arguably the most important class method is tCorpus$preprocess() which allows us to apply typical pre-processing steps to the whole corpus:

corpus$preprocess(column = 'token',
                  new_column = 'token2',
                  lowercase = TRUE,
                  remove_punctuation = FALSE,
                  remove_stopwords = FALSE,
                  remove_numbers = FALSE,
                  use_stemming = FALSE,
                  language = "english")
corpus$tokens

Deduplication

We can remove identical or very similar documents from our corpus:

nrow(corpus$meta)
[1] 5574
corpus$deduplicate(feature='token2',
                   similarity = 0.95, # very similar documents
                   print_duplicates = FALSE) # set to TRUE if you want the IDs of removed documents
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
Deleting 521 duplicates
nrow(corpus$meta)
[1] 5053

Document feature matrix

A document-term matrix is a matrix where each row represents a single document and each column represents a feature (e.g. a token). The values in the matrix represent the frequency of a given term in a given document.

A document-term matrix can be used to tell us something about the content or style of a document. This information can then be used to retrieve relevant documents, to group similar documents by topic or to compare the style of different authors.

Depending on the task at hand, the procedure will differ, but the basic principle remains the same.

In many use cases, it makes sense to remove certain words (e.g. function words or very infrequent words) before constructing a document-term matrix. (On the other hand, sometimes function words can be very interesting, for example in stylometry. So think before thinning out your corpus!)

corpus$preprocess(column = 'token',
                  new_column = 'feature',
                  remove_stopwords = TRUE,
                  remove_numbers = TRUE,
                  use_stemming = TRUE,
                  min_docfreq = 5)

dfm <- get_dfm(corpus, 'feature')
dfm
Document-feature matrix of: 5,053 documents, 1,384 features (99.50% sparse) and 1 docvar.
           features
docs        10am 10p 12hrs 150ppm 150p 1st 2land 2mrw 2nite 2day
  doc_00001    0   0     0      0    0   0     0    0     0    0
  doc_00002    0   0     0      0    0   0     0    0     0    0
  doc_00003    0   0     0      0    0   0     0    0     0    0
  doc_00004    0   0     0      0    0   0     0    0     0    0
  doc_00005    0   0     0      0    0   0     0    0     0    0
  doc_00006    0   0     0      0    0   0     0    0     0    0
[ reached max_ndoc ... 5,047 more documents, reached max_nfeat ... 1,374 more features ]

You can see that most values in the matrix are zero. A matrix like this is called a sparse matrix. The percentage of zero-valued elements is often called sparsity – our matrix here is 99.5% sparse, so only 0.5% of elements are non-zero. Specialised algorithms and data structures can take advantage of sparsity, while “regular” ones used on dense matrices can be slow and inefficient as they waste time and memory on all those zeros.

The raw frequencies seen above can also be weighted differently. The best procedure depends on what you want to do!

A common approach is term frequency–inverse document frequency (tf-idf). Term frequency is what we’ve got so far: how often a token (or term) occurs in a document. (A number of adjustments is possible, e.g. for document length.)
Document frequency, on the other hand, is how many documents contain a certain token (term) – regardless of how often it occurs within individual documents. Inverse document frequency is just the inverse (1 / df), so the smaller this value, the more common the token across documents. We usually take the natural logarithm of this quotient.
Finally, tf-idf is the product of term frequency and inverse document frequency. Idf therefore functions as a weight for the term frequency: the less common a token is across documents, the more important it is in a document in which it occurs. And vice versa, the more common a token generally is (think of words like is, and, the, people etc.), the less important it is in a single document, even if it occurs very frequently.

dfm_weighted <- get_dfm(corpus, 'feature', weight = 'tfidf')
dfm_weighted
Document-feature matrix of: 5,053 documents, 1,384 features (99.50% sparse) and 1 docvar.
           features
docs        10am 10p 12hrs 150ppm 150p 1st 2land 2mrw 2nite 2day
  doc_00001    0   0     0      0    0   0     0    0     0    0
  doc_00002    0   0     0      0    0   0     0    0     0    0
  doc_00003    0   0     0      0    0   0     0    0     0    0
  doc_00004    0   0     0      0    0   0     0    0     0    0
  doc_00005    0   0     0      0    0   0     0    0     0    0
  doc_00006    0   0     0      0    0   0     0    0     0    0
[ reached max_ndoc ... 5,047 more documents, reached max_nfeat ... 1,374 more features ]

When tidytext is loaded, you can use tidy() to convert a document-feature matrix to a tibble. This might make it easier to get a sense of the data.

dfm_weighted |> tidy() |> arrange(desc(count))

This should probably tell us that more filtering before creating the dtm and/or better pre-processing and tokenisation might be a good idea. :)

Search the corpus

query_result <- search_features(corpus,
                                feature = "token",
                                query = c('call', 'phone'))
table(as.character(query_result$hits$feature))

 call  Call  CALL phone Phone PHONE 
  296   118    21    94     7     4 

Visualise results

queries <- data.frame(label = c('call', 'email', 'meet'),
                      query = c('call* OR phone',
                                'email OR e-mail',
                                'meet* OR see*'))

hits <- search_features(corpus,
                        query = queries$query,
                        code = queries$label)

count_tcorpus(corpus, hits = hits)
category_hits = count_tcorpus(corpus,
                              hits,
                              meta_cols = 'y',
                              wide = FALSE)
  
ggplot(category_hits, aes(x = y, y = count, fill = code)) +
  geom_col(position = "dodge") # geom_col() = geom_bar(stat = "identity")

Display results

If we’re also interested to see results in context, we can use browse_hits() to create a static HTML page of our hits or get_kwic() to get a data.frame with a KWIC (keyword in context) column.

url = browse_hits(corpus, hits, view = TRUE)
get_kwic(corpus, query = "call*", n = 2)

Package: tidytext

tidytext is designed to work with tidy text data, meaning one token per row. Some of the structures we’ve seen above don’t conform to this, so tidytext offers the tidy() function to tidy data from other packages and to convert tidy data into other formats expected by these packages.

Create a tibble out of a tCorpus object:

sms_tidy <- as_tibble(corpus$tokens) |> 
  inner_join(corpus$meta, by = 'doc_id') |>
  mutate(word = token2)
sms_tidy

Looking at the feature column, we can see that the stemming we did earlier looks pretty horrible.

Alternatively, we could apply the tidytext function unnest_tokens() to the raw texts to get a similar result:

smsspam |>
  as_tibble() |>
  mutate(text_id = 1:n()) |> # add text ids
  unnest_tokens(output = sentence, input = text, token = "sentences") |> # split sentences
  group_by(text_id) |> # group by text for next step
  mutate(sentence_id = 1:n()) |> # add sentence ids
  ungroup() |> # no more grouping for next step
  unnest_tokens(output = token, input = sentence, token = "words", drop = FALSE, strip_punct = FALSE) |> # tokenise each sentence, keep other columns
  group_by(text_id, sentence_id) |> # group for next step (could also just group by text id)
  mutate(token_id = 1:n()) |>
  ungroup()

Counting words

sms_tidy |> count(word, sort = TRUE)

Removing stop words

Stop word lists are far from ideal, but sometimes, they are enough. get_stopwords() gives us a lexicon of words commonly excluded from further analysis. We can then do an anti_join() to remove them from our data.

sms_clean <- sms_tidy |> 
  anti_join(tidytext::get_stopwords("en"))
Joining, by = "word"
nrow(sms_clean) # number of rows now
[1] 66290
nrow(sms_tidy) # number of rows before
[1] 96537
sms_clean |> count(word, sort = TRUE)

We’ve still got punctuation and very short words in there, so let’s remove all that as well:

sms_clean <- sms_clean |> filter(str_length(word) > 2)
sms_clean |> count(word, sort = TRUE)

Word clouds

Let’s give wordcloud a try this time.

sms_clean |> count(word) |> with(wordcloud(word, n, max.words=50))

Of course, ggwordcloud works just as well – and allows us to group the words by classification (spam or not).

set.seed(1)
sms_clean |>
  group_by(y) |>
  count(word) |>
  arrange(y, desc(n)) |>
  summarise(head(across(), 100)) |>
  ungroup() |>
  mutate(angle = 90 * sample(c(0, 1), n(),
                             replace = TRUE, prob = c(60, 40))) |>
  ggplot(aes(label = word, size = n,
             colour = y, x = y,
             angle = angle)) +
  geom_text_wordcloud(area_corr = TRUE) +
  scale_size_area(max_size = 12) +
  labs(title = "Most frequent words", x = "Classification") +
  theme_minimal() +
  theme(panel.grid.major = element_blank())
`summarise()` has grouped output by 'y'. You can override using the `.groups`
argument.

Document-feature matrix

Instead of using the document-term matrix from above, we can also easily create a new one without using the corpustools package.

tidy_dtm <- sms_clean |>
  group_by(doc_id, word) |>
  summarise(count = n()) |>
  ungroup() |>
  arrange(doc_id, word)
`summarise()` has grouped output by 'doc_id'. You can override using the
`.groups` argument.
rare_terms <- sms_clean |>
  count(word) |>
  filter(n < 4) |>
  arrange(desc(n))

tidy_dtm <- tidy_dtm |>
  anti_join(rare_terms) # remove rare terms
Joining, by = "word"
tidy_dtm

To get a proper matrix, we can convert a tibble like this using cast_dfm(), specifying which columns in our data contain document identifiers, terms and frequencies.

tidy_dtm |>
  mutate(word = as.character(word)) |> # next function doesn't like factors
  cast_dfm(doc_id, word, count)
Document-feature matrix of: 4,957 documents, 1,740 features (99.65% sparse) and 0 docvars.
           features
docs        available bugis cine crazy got great world wat point joking
  doc_00001         1     1    1     1   1     1     1   1     1      0
  doc_00002         0     0    0     0   0     0     0   0     0      1
  doc_00003         0     0    0     0   0     0     0   0     0      0
  doc_00004         0     0    0     0   0     0     0   0     0      0
  doc_00005         0     0    0     0   0     0     0   0     0      0
  doc_00006         0     0    0     0   0     0     0   0     0      0
[ reached max_ndoc ... 4,951 more documents, reached max_nfeat ... 1,730 more features ]

We can also use cast_dtm() to get a document-feature matrix in the DocumentTermMatrix format from the tm (text mining) package.

tidy_dtm |>
  mutate(word = as.character(word)) |> # next function doesn't like factors
  cast_dtm(doc_id, word, count)
<<DocumentTermMatrix (documents: 4957, terms: 1740)>>
Non-/sparse entries: 30542/8594638
Sparsity           : 100%
Maximal term length: 17
Weighting          : term frequency (tf)

Both functions have the optional parameter weighting […]

For tf-idf, there’s also the function bind_tf_idf() which can be applied to a tidy tibble with one row per token (or term), per document (such as tidy_dtm):

tidy_dtm |> bind_tf_idf(word, doc_id, count)

Topic modelling

Topic modelling uses Latent Dirichlet Allocation (LDA) to extract a number of topics from a collection of texts.

The required input for LDA() is a document-term matrix in the DocumentTermMatrix format from the tm (text mining) package.

sms_lda <- tidy_dtm |>
  mutate(word = as.character(word)) |> # next function doesn't like factors
  cast_dtm(doc_id, word, count)  |>
  LDA(k = 4)
topics <- tidy(sms_lda, matrix = "beta")
topics

Let’s see what the top 20 terms for each topic are:

top_terms <- topics |>
  group_by(topic) |>
  slice_max(beta, n = 20) |>
  ungroup() |>
  arrange(topic, -beta)

top_terms |>
  mutate(term = reorder_within(term, beta, topic)) |>
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free", ncol = 4) +
  scale_y_reordered()

Honestly, we can’t make much sense of this. We could filter more to get a better input for the LDA, but it may also be that topic modelling just doesn’t work well for such short messages.

We’ll give it another try with some CMC data from Twitter and Reddit. Since we’ve got part-of-speech tags in this case, we’ll only keep certain content words for our analysis – no need for stop word lists!

cmc <- read_csv2("../data/cmc-sample.csv")
ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.
Rows: 104128 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ";"
chr (7): token, pos, lemma, lemma_ab, lemma_nd, comment_ab, source
dbl (1): id
lgl (1): comment_nd

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cmc$textid <- str_extract(cmc$token, "(?<=<text_id )[[:alnum:]]+")
cmc <- cmc |> fill(textid, .direction = "downup")

# new column: year/month
cmc$year_month <- str_extract(cmc$token, "(?<=<text_ym )[0-9]+")
cmc <- cmc |> fill(year_month, .direction = "downup")

# new column: sentence id
cmc <- cmc |> mutate(sid = cumsum(str_detect(token, "<s>")))

# no more XML tags
cmc <- cmc |> filter(!is.na(pos))

# new column: extract year from year_month
cmc$year <- str_extract(cmc$year_month, "[0-9]{4}")

# replace HTML entities
cmc <- cmc |> mutate(token = str_replace(token, "&apos;", "'"),
                      token = str_replace(token, "&quot;", '"'),
                      token = str_replace(token, "&amp;", "&"),
                      token = str_replace(token, "&lt;", "<"),
                      token = str_replace(token, "&gt;", ">"),
                      token = str_replace(token, "&#039;", "'"),
                      token = str_replace(token, "&#x25;", "%"))

tags <- c("ADJD", "ADJA", "NN", "NE", "TRUNC", "HST") # c("ADJD", "ADJA", "NN", "NE", "TRUNC", "VVFIN", "VVIMP", "VVINF", "VVIZU", "VVPP", "HST")

cmc_content <- cmc |>
  filter(pos %in% tags,
         str_length(lemma) > 1) |>
  select(source, textid, sid, tid = id, token, lemma = lemma_ab,
         pos, year_month, year)
cmc_content
cmc_dtm <- cmc_content |>
  group_by(textid) |>
  count(lemma) |>
  ungroup() |>
  arrange(textid, lemma)

cmc_rare_terms <- cmc_content |>
  count(lemma) |>
  filter(n < 4) |>
  arrange(desc(n))

cmc_dtm <- cmc_dtm |>
  anti_join(cmc_rare_terms, by = "lemma")

cmc_dtm <- cmc_dtm |> bind_tf_idf(lemma, textid, n)

cmc_dtm <- cmc_dtm |> filter(tf < .5, tf_idf >= .3)

cmc_dtm
cmc_lda <- cmc_dtm |>
  cast_dtm(textid, lemma, n)  |>
  LDA(k = 12, control = list(nstart = 50)) # 50 repeated runs with random initialisations
  # LDA(k = 12, method = "Gibbs", control = list(seed = 42, # setting a seed makes the result reproducible
  #                                             iter = 3000,
  #                                             thin = 10,
  #                                             burnin = 1000)) 

topics <- tidy(cmc_lda, matrix = "beta")
topics
top_terms <- topics |>
  group_by(topic) |>
  slice_max(beta, n = 20) |>
  ungroup() |>
  arrange(topic, -beta)

top_terms |>
  mutate(term = reorder_within(term, beta, topic)) |>
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  scale_y_reordered() +
  labs(x = expression(beta), y = "Lemma")

In contrast to the topics from the SMS corpus, these topics are at least interpretable and partly coincide with our own observations (e.g. Brexit being a popular topic).

However, the topics are relatively unstable, so that you often get somewhat different topics in different runs. A higher value of nstart (for more repeated runs) may increase the topic stability, but will also greatly increase the execution time.

Topic modelling also works much better for collections of longer documents.

Which lemmas in individual documents are assigned to which topics?

augment(cmc_lda, cmc_dtm |> rename(document = textid, term = lemma)) |>
  select(document, term, .topic, everything())

We can also look at the estimated proportions of terms from each document belonging to each topic:

cmc_gamma <- tidy(cmc_lda, matrix = "gamma") |>
  arrange(document, desc(gamma))
cmc_gamma

Are there differences between Reddit and Twitter regarding the estimated proportions?

document_source <- cmc |>
  select(textid, source, year) |>
  rename(document = textid) |>
  distinct()
cmc_gamma <- cmc_gamma |>
  left_join(document_source)
Joining, by = "document"
cmc_gamma |>
  ggplot(aes(x = factor(topic), y = gamma, colour = source)) +
  geom_boxplot(outlier.alpha = .1) +
  labs(x = "Topic", y = expression(gamma), colour = "Source")

Sentiment analysis

Sentiment analysis, or opinion mining, in its simplest form is used to assess whether a text (or a section of a text) is positive or negative (e.g. product reviews). More sophisticated forms of sentiment analysis are used to assess a wider range of emotional content (surprise, anger, love, disgust, …).

An easy (but crude) approach to sentiment analysis would be to use a sentiment lexicon which assigns sentiment labels to specific tokens (e.g. idiotic => negative sentiment).

sentiment_scores <- tidytext::get_sentiments(lexicon = "bing")
positive <- sentiment_scores |> filter(sentiment == "positive")
negative <- sentiment_scores |> filter(sentiment == "negative")

sms_tidy |> 
  filter(y == "ham") |>
  semi_join(positive) |>
  count(word, sort = TRUE)
Joining, by = "word"
sms_tidy |> 
  filter(y == "spam") |>
  semi_join(positive) |>
  count(word, sort = TRUE)
Joining, by = "word"
sms_tidy |>
  inner_join(sentiment_scores) |>
  ggplot(aes(x = y, fill = sentiment)) +
  geom_bar(position = "dodge")
Joining, by = "word"

Further reading and exercises